Inspired from the pipeline by Pierre-Luc Germain presented in course STA426 (UZH)

Introduction

Motivations

Skin cancer is a disease that afflicts about 3.3M Americans each year (about 1% of the population) and is by far the most prevalent type of cancer. In fact, more people are diagnosed with skin cancer each year in the U.S. than all other cancers combined. ((“Skin Cancer Facts & Statistics - the Skin Cancer Foundation” 2020))

A common type of skin cancer is called squamous cell carcinoma (SCC). This is the second most common form of skin cancer and afflicts roughly 1M Americans a year. The main causes of SCC are mutations arising from UV radiation. (“Skin Cancer Facts & Statistics - the Skin Cancer Foundation” 2020)

Actinik keratosis (AK) is the most common form of precancer that results from long-term exposure to UV rays. Over time, actinic keratoses may develop into squamous cell carcinoma. This condition affects more than 58 million americans. ((“Skin Cancer Facts & Statistics - the Skin Cancer Foundation” 2020))

Understanding the heterogeneity of skin cancer cells as wall as their cell type composition over disease development are key to find new therapeutic interventions (McFarland et al. (2020))

– optional if we find something. Particularly, gaining an understanding of the biology of Cancer stem cells (CSC) is important, as they lead to metastasis and resistance to therapies. (link to )

In the following report, we analyze single cell RNA data stemming from 2 patients, one with SCC and one with AK.

Data

In the following we will be looking at CITE-seq data gathered by Levesque’s group for two patients. For both patients we have a sample of their healthy skin and and then a sample of either cancer or precancer skin. More precisely, patient one has a squamous cell carcinoma sample and patient two an actinic keratosis (pre-maligant lesion) sample. CITE-seq data (Cellular Indexing of Transcriptomes and Epitopes by Sequencing) is a single-cell sequencing method to not just sequence the transcriptome but also get information on the surface proteins of a cell. To this end surface proteins of interest are stained with antibody-derived tags (ADTs), which are then simultaneously sequenced with the transcriptome. The sequencing data is then analyzed with 10X genomics cell ranger pipeline. This pipeline will clean up the data a little bit, align it to a genome of interest and compute a counts matrix containing the number of transcripts of a specific gene in a specific cell. Additionally there’s also a step that takes this output and filters it to get rid of emtpy droplets. In the end we have for each patient and each sample type, a raw and a filtered count matrix. After comparing our pipeline run with both raw and filtered data, we decided to stick with the raw data since the quality control and the clustering seemed to look better. We also realized after some quality control steps, that there seemed to be a problem with ADT data, which is why we discarded it for our pipeline. The reason for the bad quality of ADT data according to 10x is probably either physically aggregated antibodies or/and that the ADT library may have been undersequenced.

Bioconductor

One of the tools we use quite extensively in this pipeline is Bionconductor. Bioconductor contains a pool of tools, which can be used for the analysis of highthrouput genomic data in R. Since we’re interested in analyzing Levesque’s data, we mainly use bioconductor’s tools specialized to process, analyze and visualize single-cell RNA-seq data.

Preprocessing

Loading necessary libraries

suppressPackageStartupMessages({
  library(SingleCellExperiment)
  library(scran)
  library(scater)
  library(batchelor)
  library(scDblFinder)
  library(sctransform)
  library(muscat)
  library(SEtools)
  library(cowplot)
  library(BiocParallel)
  library(ComplexHeatmap)

  library(DropletUtils) # for read10xCounts
  library(Matrix) # to handle sparse matrix
  library(BiocNeighbors) # for kNN graph
  library(igraph) # for cluster_louvain

  library(devtools)
  library(easyGgplot2) # for pretty barplots

  library(knitr) # for markers table output
})
# set local path
set.seed(11)
local.path <- getwd()
setwd(local.path)

# use source() function to source the functions we want to execute
source("functions/functions_qc.R")
source("functions/functions_analysis.R")

Loading the data

For the following chunk, three options are available. The filtered data provided by 10X can be read and used directly. Otherwise, the same data from 10X are read but also preprocess using quality check. The third option is to load data that were previously preprocessed and saved.

## data can be downloaded from : http://imlspenticton.uzh.ch/dump/files_for_levesque.tar

# Pick your poison
option_selected = 3

# Decide if the code should be run using raw data or filtered data from 10X
# default is TRUE
use_raw_data = TRUE

# and let it begin
switch(option_selected,
       # Option 1: read data without doing any QC
         sces <- read_10x(use_raw_data), 
       # Option 2: read data, run QC and save clean data for later use via Option 3
       # n_cores: numbers of cores to use (parallelization used to speed up emptyDrops function), 
       # default is 4
       {
         n_cores = 4
         sces <- read_and_qc(n_cores, use_raw_data)
       },
       # Option 3: read clean data produced by previous run of Option 2
       # Warning! Option 2 should have been run at least once before trying Option 3
       sces <- read_previous_data(use_raw_data),
)
sces <- lapply(sces, split_data)

# only select the gene expression data, drop the antibody capture
sce.patient1_HS <- sces[[1]]
sce.patient1_SCC <- sces[[2]]
sce.patient2_HS <- sces[[3]]
sce.patient2_AK <- sces[[4]]

# add name to sce for output graph
attr(sce.patient1_HS, "name") <- "Patient1 HS"
attr(sce.patient1_SCC, "name") <- "Patient1 SCC"
attr(sce.patient2_HS, "name") <- "Patient2 HS"
attr(sce.patient2_AK, "name") <- "Patient2 AK"

#make sure every rowname has the format ENS_ID.Gene_Name
rownames(sce.patient1_HS) <- paste_rownames(sce.patient1_HS)
rownames(sce.patient1_SCC) <- paste_rownames(sce.patient1_SCC)
rownames(sce.patient2_HS) <- paste_rownames(sce.patient2_HS)
rownames(sce.patient2_AK) <- paste_rownames(sce.patient2_AK)

We look at both the raw and unfiltered data from the cell ranger output, to see what yields the best results. For both data sets, we first ran them through our quality control function. Since the filtered data from 10X already has all the empty droplets removed, this step was skipped for it, but not for the raw data. In the following section we will quickly go over what steps we took in our quality control function and how we made sure they were sensible.

The first step in our pipeline is to separate the antibody-derived tag (ADT) data from the remaining mRNA counts and clean them up. Ideally we’d like to identify the cells which failed to capture the ADTs and remove them. This step is usually done by removing cells with ADT counts less than or equal to half of the total number of tags, which leads to the loss of a huge number of cells in our case as shown in the summary output. We produced a plot showing the distribution of the number of detected ADTs and observed that most cells have negligeable ADT counts. The distribution of ADT count has a smaller peak at around 50 though. From this plot we decided to change the threshold at which cells will be discarded to 25, to not throw away as many cells. But even with this change it seemed like to many cells were to be discarded because of low quality ADT data. Therefore, we finally decided to ignore this part of the data completely.

In the next step, the empty droplets were removed. As mentioned before this only happens for the raw data. We ran the function emptyDrops() to do so, which tests if the counts of a cell are significantly different from the ambient mRNA pool. If it is not the case, then we’re most likely facing an empty droplet. After looking at some summary statistics of the output of this function, we realized that there were non-significant cell barcodes, which were bounded by the number of iterations. We solved this issue by increasing the number of iterations (to 100’000) in emptyDrops(). There’s also a plot showing the histogram of the p-values given by emptyDrops() which should be approximately uniform. It was luckily the case with our data. It is interesting to note that this part of the quality control pipeline is responsible for the most amount of discarded cells.

Next we removed the dead cells, e.g. the cells with high mitochondrial gene expression. The approach we used was the same as introduced in the lecture. Lastly, we also removed cells which had low library sizes and detection rates. As seen in class, we used the perCellQCMetrics() and isOutlier() functions to do so. We also produced a plot showing the log-fold-change of the discarded vs. kept cells which helps to investigate the possibility of an entire cell type being accidentally discarded because of the quality control. If that would be the case, we would observe an enrichment of a certain cell type in the discarded group, which can be identified by an increased expression of the corresponding marker genes. In the case of the raw data, which we ended up choosing, the plot showed that this didn’t seem to be the case.

Normalization

In order to normalize our data, we first tried Library-Size normalization, which assumes that there is no “imbalance” in the differentially expressed (DE) genes between any pair of cells (Robert Amezquita (2020)). After observing that the separation of the clusters wasn’t satisfying, we decided to use Variance Stabilizing Transformation (VST). VST helps in the sens that it manage to transform the data into a more gaussian-shaped distribution.

Reduction

The first step of the dimensinality reduction of our data was to select the 2’000 most variable genes because cell type is most likely influence by those variable gene more than the less variable ones. In addition to those highly variable genes (hvg), we decided to included the genes known to be present in certain cell (sub)types rather than others, even when those specific genes were not part of the 2’000 most variable ones. We ran principal component analysis (PCA) using the selected genes and kept the 20 first principal components (PCs) based on the explained variance to transform our data. Finally, we ran tSNE and UMAP on the modified data to get a 2D representation of the data.

# genes not found in our data : "WISP2" "SEPP1" "TYP1" 
# were removed from the list below
genes <- list(
  # classification: https://www.sciencedirect.com/science/article/pii/S0923181120301985?via%3Dihub#bib0050
  # KERATINOCYTE
  keratinocyte = c("DSC3","DSP","LGALS7"),
  keratinocyte_basal = c("KRT5","KRT14","COL17A1"),
  keratinocyte_suprabasal = c("KRT1","KRT10"),
  keratinocyte_differentiated = c("LOR", "SPINK5"),
  keratinocyte_ORS = c("KRT6B"),
  keratinocyte_channel = c("GJB2","GJB6","ATP1B1"),
  keratinocyte_sebaceous_gland = c("MGST1","FASN"),
  keratinocyte_sweat_gland  = c("DCD","AQP5"),
  # for dynamics, from https://www.sciencedirect.com/science/article/pii/S0092867420306723
  keratinocyte_cycling = c("MKI67"),
  # FIBROBLAST
  fibroblast  = c("DCN","LUM","MMP2"),
  # fibroblast subclassification  from: https://www.nature.com/articles/s42003-020-0922-4#
  fibroblast_secretory_reticular = c("SLPI","CTHRC1","MFAP5","TSPAN8"),
  fibroblast_proinflammatory = c("CCL19","APOE","CXCL2","CXCL3","EFEMP1"),
  fibroblast_secretory_papillary = c("APCDD1","ID1","WIF1","COL18A1","PTGDS"),
  fibroblast_mesenchymal = c("ASPN","POSTN","GPC3","TNN","SFRP1"),
  # PERICYTE & vSMC
  pericytevSMC  = c("ACTA2","TAGLN"), #vSMC : vascular smooth muscel cell
  pericytevSMC_pericyte  = c("RGS5"),
  pericytevSMC_vSMC  = c("MYL9","TPM2","RERGL"),
  # ENDOTHELIAL CELL
  endothelial  = c("PECAM1","SELE","CLDN5","VWF"),
  endothelial_lymphatic  = c("PROX1","LYVE1"),
  # MYELOID CELL
  myeloid  = c("HLA-DRA","FCER1G","TYROBP","AIF1"),
  myeloid_dendritic  = c("CD1C"),
  myeloid_langerhans  = c("CD207"),
  myeloid_macrophage = c("CD68","RNASE1","ITGAX"),
  # LYMPHOCYTE
  lymphocyte  = c("CD3D","CD3E","CD52","IL7R"),
  # MELANOCYTE
  melanocyte  = c("DCT","MLANA","PMEL"),
  # SCHWANN CELL
  schwann  = c("CDH19","NGFR"),
  # MITOTIC CELL
  mitotic  = c("UBE2C","PCNA")
)
  
# Known markers: all the gene markers listed above
known_markers <- c(genes$keratinocyte, genes$keratinocyte_basal, genes$keratinocyte_suprabasal, genes$keratinocyte_differentiated, genes$keratinocyte_ORS, genes$keratinocyte_channel, genes$keratinocyte_sebaceous_gland, genes$keratinocyte_sweat_gland, genes$keratinocyte_cycling, genes$fibroblast, genes$fibroblast_secretory_reticular, genes$fibroblast_proinflammatory, genes$fibroblast_secretory_papillary, genes$fibroblast_mesenchymal, genes$pericytevSMC, genes$pericytevSMC_pericyte, genes$pericytevSMC_vSMC, genes$endothelial, genes$endothelial_lymphatic, genes$myeloid, genes$myeloid_dendritic, genes$myeloid_langerhans, genes$myeloid_macrophage, genes$lymphocyte, genes$melanocyte, genes$schwann, genes$mitotic)


sce.patient1_HS <- norm_redu(sce.patient1_HS, 2000, known_markers)
sce.patient1_SCC <- norm_redu(sce.patient1_SCC, 2000, known_markers)
sce.patient2_HS <- norm_redu(sce.patient2_HS, 2000, known_markers)
sce.patient2_AK <- norm_redu(sce.patient2_AK, 2000, known_markers)

Clustering

In order to label each cell by its type in a meaningfull way, we used the low-dimensionality representation we just produced and build a KNN graph from it. Chosing \(k=30\) means that each cell is connected to its 30 closest cells. Once we have a graph, we can use the cluster_louvain function which is based on the modularity measure and a hierachical approach to find community structure of the data ((n.d.)). At this stage, the data are clustered based only on their structure, not on their specific markers (c.f. exemple for Patient1 HS below). To make the link from one cluster to a cell type, we aggregate the gene expression level of cells within a cluster to get a single value. This aggregation summarizes the relative expression of each known markers for each cluster. The next step is to bring clusters having the same expression profile under the same cell type label. There is most likely room for improvement when looking at the Patient2 AK case presented below. For instance clusters with no strong expression for any of the known markers could be left out or labelled as undefined. The assumption is that well defined cell outnumber the badly defined ones, giving meaningful statistical results in the end. Once the major cell types are identified, we took a look at the keratinocyte and fibroblast subtypes using the same steps but on those specific subsets.

results <- sc_cluster(sce.patient1_HS)

sce.patient1_HS <- results[[1]]
g.patient1_HS <- results[[2]]

results <- sc_cluster(sce.patient1_SCC)
sce.patient1_SCC <- results[[1]]
g.patient1_SCC <- results[[2]]

results <- sc_cluster(sce.patient2_HS)
sce.patient2_HS <- results[[1]]
g.patient2_HS <- results[[2]]

results <- sc_cluster(sce.patient2_AK)
sce.patient2_AK <- results[[1]]
g.patient2_AK <- results[[2]]

Cluster annotation

The markers found in the literature (Kim, Chung, and Kim (2020), Solé-Boldo (2020)) were used to separate the different cell types and their subtypes within them. A few markers didn’t appear at all in our data (WISP2 Secretory-reticular fibroblast), SEPP1 Macrophage), TYP1 Melanocyte)) so the corresponding cell types were selected using their remaining markers present. Since those markers found in the literature are known to correlate with specific cell types, their genes were added to the list of high variable genes used to produce a lower dimensional representation of the data, even when their expression didn’t vary enough to be selected as highly variable.

Types Subtypes Markers
Keratinocyte DSC3 DSP LGALS7
basal KRT5 KRT14 COL17A1
suprabasal KRT1 KRT10
differentiated LOR SPINK5
ORS KRT6B
channel GJB2 GJB6 ATP1B1
sebaceous gland MGST1 FASN
sweat gland DCD AQP5
cycling MKI67
differentiating KRT1
Fibroblast DCN LUM MMP2
secretory reticular SLPI CTHRC1 MFAP5 TSPAN8
proinflammatory CCL19 APOE CXCL2 CXCL3 EFEMP1
secretory papillary APCDD1 ID1 WIF1 COL18A1 PTGDS
mesenchymal ASPN POSTN GPC3 TNN SFRP1
Pericyte & vSMC ACTA2 TAGLN
pericyte RGS5
vSMC MYL9 TPM2 RERGL
Endothelial cell PECAM1 SELE CLDN5 VWF
lymphatic PROX1 LYVE1
Myeloid HLA-DRA FCER1G TYROBP AIF1
dendritic CD1C
langerhans CD207
macrophage CD68 RNASE1 ITGAX
Lymphocyte CD3D CD3E CD52 IL7R
Melanocyte DCT MLANA PMEL
Schwann cell CDH19 NGFR
Mitotic cell UBE2C PCNA
genes_type <- list(keratinocyte = genes$keratinocyte, 
                   fibroblast = genes$fibroblast, 
                   pericytevSMC = genes$pericytevSMC, 
                   endothelial = genes$endothelial, 
                   myeloid = genes$myeloid, 
                   lymphocyte = genes$lymphocyte, 
                   melanocyte = genes$melanocyte, 
                   schwann = genes$schwann, 
                   mitotic = genes$mitotic)
  
results <- annotate_cells(sce.patient1_HS, g.patient1_HS, genes_type)
sce.patient1_HS <- results[[1]]
km.patient1_HS <- results[[2]]

results <- annotate_cells(sce.patient1_SCC, g.patient1_SCC, genes_type)
sce.patient1_SCC <- results[[1]]
km.patient1_SCC <- results[[2]]

results <- annotate_cells(sce.patient2_HS, g.patient2_HS, genes_type)
sce.patient2_HS <- results[[1]]
km.patient2_HS <- results[[2]]

results <- annotate_cells(sce.patient2_AK, g.patient2_AK, genes_type)
sce.patient2_AK <- results[[1]]
km.patient2_AK <- results[[2]]

Pseudo-bulk aggregation

One of the interesting biological questions is how the cell subtype composition changes between healthy skin and diseased skin. For this, simply extracted the cell subtypes for each sample and compare the proportion of cells for each subtype. We used the pseudobulk workflow suggested in the lecture, which ‘bulks’ gene expression levels for each cluster.

A Fisher’s exact test helps us verify that for this particular patient and collection method, the proportion of certain subtypes really differs among conditions.

#Cell Subtype composition varies a lot. For Patient 1, it seems that their diseased skin (SCC) contains a far higher proportion of lymphocytes, endothelial cells and myleloid cells. Conversely, the proportion of keratinocytes is far lower. For Patient 2 with (AK), the lymphocytes is far higher, similar for fibroblasts and endothelial cells but lower for all other cell type when compared to healthy skin.

#Mitotic Cells and Keratinocytes are hard to differentiate in Patient 2 with AK. When it comes to mitotic cells, the marker genes used to quantify mitotic cells play a significant role. Using the suggested marker genes from Kim, Chung, and Kim (2020), we find mitotic cells in both healthy samples, but not in the diseased samples (AK + SCC). We experimented with different marker genes for the mitotic cells, adding some we found in Hsiao et al. (2020), but since our clustering algorithm didn’t find clusters different enough from each other, we are left with either labeling some cells as all mitotic or all keratinocytes for patient 2.

# Patient 1 HS
results <- pseudobulk(sce.patient1_HS, km.patient1_HS)

sce.patient1_HS <- results[[1]]
pb.patient1_HS <- results[[2]]
mat.patient1_HS <- results[[3]]
cl2.patient1_HS <- results[[4]]

#Patient 1 SCC
results <- pseudobulk(sce.patient1_SCC, km.patient1_SCC)

sce.patient1_SCC <- results[[1]]
pb.patient1_SCC <- results[[2]]
mat.patient1_SCC <- results[[3]]
cl2.patient1_SCC <- results[[4]]

#Patient 2 HS
results <- pseudobulk(sce.patient2_HS, km.patient2_HS)

sce.patient2_HS <- results[[1]]
pb.patient2_HS <- results[[2]]
mat.patient2_HS <- results[[3]]
cl2.patient2_HS <- results[[4]]

#Patient 2 AK
results <- pseudobulk(sce.patient2_AK, km.patient2_AK)

sce.patient2_AK <- results[[1]]
pb.patient2_AK <- results[[2]]
mat.patient2_AK <- results[[3]]
cl2.patient2_AK <- results[[4]]

# create barplots of cell type proportions
df.celltypes.patient1 <- rbind(df_barplot_celltypes(sce.patient1_HS), df_barplot_celltypes(sce.patient1_SCC))
df.celltypes.patient1$Type <- c(rep("HS", 8), rep("SCC", 8))

df.celltypes.patient2 <- rbind(df_barplot_celltypes(sce.patient2_HS), df_barplot_celltypes(sce.patient2_AK))
df.celltypes.patient2$Type <- c(rep("HS", 8), rep("AK", 8))

dynamic_barplot(df.celltypes.patient1, "Patient 1", T, "Proportion of cell types in", c(HS ='#999999', SCC = '#E69F00'))

dynamic_barplot(df.celltypes.patient2, "Patient 2", T, "Proportion of cell types in", c(HS ='#999999', AK = '#E69F00'))

# p-value calculation
fisher_test_subtypes(sce.patient1_HS, sce.patient1_SCC, 'keratinocyte')
## 
##  Fisher's Exact Test for Count Data
## 
## data:  subtype_matrix
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  2.150620 2.760102
## sample estimates:
## odds ratio 
##   2.435241
fisher_test_subtypes(sce.patient1_HS, sce.patient1_SCC, 'lymphocyte')
## 
##  Fisher's Exact Test for Count Data
## 
## data:  subtype_matrix
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.2203613 0.3059767
## sample estimates:
## odds ratio 
##  0.2599188
fisher_test_subtypes(sce.patient1_HS, sce.patient1_SCC, 'fibroblast')
## 
##  Fisher's Exact Test for Count Data
## 
## data:  subtype_matrix
## p-value = 3.54e-09
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.3784739 0.6217369
## sample estimates:
## odds ratio 
##  0.4855149

Sub-population clustering

Keratinocytes subcomposition

Keratinocytes constitute 90% of epidermal skin cells. The epidermis is the outermost layer of the skin. They form a barrier against environmental damage by heat, UV radiation, water loss, various parasites such as viruses, fungi and pathogenic bacteria.

known_markers.kera <- c(genes$keratinocyte_basal, genes$keratinocyte_suprabasal, genes$keratinocyte_differentiated, genes$keratinocyte_ORS, genes$keratinocyte_channel, genes$keratinocyte_sebaceous_gland, genes$keratinocyte_sweat_gland)

# the following list is used to have prettier graph annotation
  genes.kera <- list(
    basal = genes$keratinocyte_basal,
    suprabasal = genes$keratinocyte_suprabasal,
    differentiated = genes$keratinocyte_differentiated,
    ORS = genes$keratinocyte_ORS,
    channel =  genes$keratinocyte_channel,
    sebaceous_gland = genes$keratinocyte_sebaceous_gland,
    sweat_gland  = genes$keratinocyte_sweat_gland
    )

sce.patient1_HS.kera <- cluster_subtype(sce.patient1_HS, "keratinocyte", known_markers.kera, genes.kera)

sce.patient1_SCC.kera <- cluster_subtype(sce.patient1_SCC, "keratinocyte", known_markers.kera, genes.kera)

sce.patient2_HS.kera <- cluster_subtype(sce.patient2_HS, "keratinocyte", known_markers.kera, genes.kera)

sce.patient2_AK.kera <- cluster_subtype(sce.patient2_AK, "keratinocyte", known_markers.kera, genes.kera)

Keratinocytes Dynamics

# the following list is used to have prettier graph annotation
genes.dyn <- list(
  # classification: https://www.sciencedirect.com/science/article/pii/S0092867420306723
  basal = genes$keratinocyte_basal[3], # we only select "COL17A1" marker to define the basal population
  cycling = genes$keratinocyte_cycling,
  differentiating = genes$keratinocyte_suprabasal[1] # we only select "KRT1"
  )

kera.dyn.patient1_HS <- dynamic_plot(sce.patient1_HS.kera, g.patient1_HS.kera, genes.dyn)
## class: SingleCellExperiment 
## dim: 17086 1372 
## metadata(2): Samples scDblFinder.stats
## assays(2): counts logcounts
## rownames(17086): ENSG00000241860.AL627309.5 ENSG00000237491.LINC01409
##   ... ENSG00000278817.AC007325.4 ENSG00000277196.AC007325.2
## rowData names(4): ID Symbol Type scDblFinder.selected
## colnames(1372): AAACCTGAGGGCTCTC-1 AAACCTGCAATGGAAT-1 ...
##   TTTGTCACAATCGGTT-1 TTTGTCAGTTCACCTC-1
## colData names(15): Sample Barcode ... cluster2 cluster3
## reducedDimNames(3): PCA TSNE UMAP
## altExpNames(1): Antibody Capture
kera.dyn.patient1_SCC <- dynamic_plot(sce.patient1_SCC.kera, g.patient1_SCC.kera, genes.dyn)
## class: SingleCellExperiment 
## dim: 15583 482 
## metadata(2): Samples scDblFinder.stats
## assays(2): counts logcounts
## rownames(15583): ENSG00000237491.LINC01409 ENSG00000228794.LINC01128
##   ... ENSG00000277666.AC136352.3 ENSG00000273554.AC136616.1
## rowData names(4): ID Symbol Type scDblFinder.selected
## colnames(482): AAACGGGGTCCGTGAC-1 AAACGGGGTCGGCACT-1 ...
##   TTTGCGCGTCGGCATC-1 TTTGGTTAGTCATGCT-1
## colData names(15): Sample Barcode ... cluster2 cluster3
## reducedDimNames(3): PCA TSNE UMAP
## altExpNames(1): Antibody Capture
kera.dyn.patient2_HS <- dynamic_plot(sce.patient2_HS.kera, g.patient2_HS.kera, genes.dyn)
## class: SingleCellExperiment 
## dim: 16321 2375 
## metadata(2): Samples scDblFinder.stats
## assays(2): counts logcounts
## rownames(16321): ENSG00000237491.LINC01409 ENSG00000228794.LINC01128
##   ... ENSG00000278817.AC007325.4 ENSG00000277196.AC007325.2
## rowData names(4): ID Symbol Type scDblFinder.selected
## colnames(2375): AAACCTGAGACATAAC-1 AAACGGGAGGAGCGAG-1 ...
##   TTTGTCATCAGTCAGT-1 TTTGTCATCTATCCCG-1
## colData names(15): Sample Barcode ... cluster2 cluster3
## reducedDimNames(3): PCA TSNE UMAP
## altExpNames(1): Antibody Capture
kera.dyn.patient2_AK <- dynamic_plot(sce.patient2_AK.kera, g.patient2_AK.kera, genes.dyn)
## class: SingleCellExperiment 
## dim: 15610 451 
## metadata(2): Samples scDblFinder.stats
## assays(2): counts logcounts
## rownames(15610): ENSG00000237491.LINC01409 ENSG00000228794.LINC01128
##   ... ENSG00000273554.AC136616.1 ENSG00000277196.AC007325.2
## rowData names(4): ID Symbol Type scDblFinder.selected
## colnames(451): AAAGTAGCACCGCTAG-1 AAATGCCTCGGAATCT-1 ...
##   TTTGGTTCATGAACCT-1 TTTGGTTGTGGTAACG-1
## colData names(15): Sample Barcode ... cluster2 cluster3
## reducedDimNames(3): PCA TSNE UMAP
## altExpNames(1): Antibody Capture
kera.dyn.patient1 <- rbind(kera.dyn.patient1_HS, kera.dyn.patient1_SCC)
kera.dyn.patient1$Type <- c(rep("HS", 3), rep("SCC", 3))
kera.dyn.patient2 <- rbind(kera.dyn.patient2_HS, kera.dyn.patient2_AK)
kera.dyn.patient2$Type <- c(rep("HS", 3), rep("AK", 3))


dynamic_barplot(kera.dyn.patient1, "Patient 1", F, "Proportion of Keratinocyte states in", c(HS ='#999999', SCC = '#E69F00'))

dynamic_barplot(kera.dyn.patient2, "Patient 2", F, "Proportion of Keratinocyte states in", c(HS ='#999999', AK = '#E69F00'))

Fibroblast subcomposition

known_markers.fibro <- c(genes$fibroblast_secretory_reticular, genes$fibroblast_proinflammatory, genes$fibroblast_secretory_papillary, genes$fibroblast_mesenchymal)

# the following list is used to have prettier graph annotation
genes.fibro <- list(
   # fibroblast subclassification  from: https://www.nature.com/articles/s42003-020-0922-4#
   secretory_reticular = genes$fibroblast_secretory_reticular,
   proinflammatory = genes$fibroblast_proinflammatory,
   secretory_papillary = genes$fibroblast_secretory_papillary,
   mesenchymal = genes$fibroblast_mesenchymal
)

sce.patient1_HS.fibro <- cluster_subtype(sce.patient1_HS, "fibroblast", known_markers.fibro, genes.fibro)

sce.patient1_SCC.fibro <- cluster_subtype(sce.patient1_SCC, "fibroblast", known_markers.fibro, genes.fibro)

sce.patient2_HS.fibro <- cluster_subtype(sce.patient2_HS, "fibroblast", known_markers.fibro, genes.fibro)

sce.patient2_AK.fibro <- cluster_subtype(sce.patient2_AK, "fibroblast", known_markers.fibro, genes.fibro)

Markers correspondance

The following code was used to make sure that the markers described in the literature were found in our dataset. It allowed to identify that WISP2, SEPP1 and TYP1 had to be discarded.

data_markers <- rowData(sce.patient1_HS)$Symbol
known_markers <- c("DSC3","DSP","LGALS7","KRT5","KRT14","COL17A1","KRT1","KRT10","LOR", "SPINK5","KRT6B","GJB2","GJB6","ATP1B1","MGST1","FASN","DCD","AQP5","DCN","LUM","MMP2","WISP2","SLPI","CTHRC1","MFAP5","TSPAN8","CCL19","APOE","CXCL2","CXCL3","EFEMP1","APCDD1","ID1","WIF1","COL18A1","PTGDS","ASPN","POSTN","GPC3","TNN","SFRP1","ACTA2","TAGLN","RGS5","MYL9","TPM2","RERGL","PECAM1","SELE","CLDN5","VWF","PROX1","LYVE1","HLA-DRA","FCER1G","TYROBP","AIF1","CD1C","CD207","CD68","RNASE1","SEPP1","ITGAX","CD3D","CD3E","CD52","IL7R","DCT","MLANA","PMEL","TYP1","CDH19","NGFR","UBE2C","PCNA")


idx_notfound <- which(!(known_markers %in% rowData(sce.patient1_HS)$Symbol))

print(paste("Done:", length(known_markers)-length(idx_notfound), "/", length(known_markers), "markers found"))
cat("Genes missing in dataset: ", known_markers[idx_notfound])

Summary

We started by cleaning up the raw data and applying clustering algorithm to identify the cell type populations mentioned above. The clustering of main cell type appeared to be quite satisfying whereas the cell subtypes were more challenging to separate. We observed differences in the cell (sub)type proportion in the different samples, in particular for lymphocytes and keratinocytes. The general trend showed an increase of lymphocytes and a decrease of keratinocytes in cancer tissue. We also took a look at the dynamics of the keratinocytes ADD DESCRIPTION ON RESULTS.

Discussion

The first point which should be discussed is related to the few relatively arbitrary cut-off values that were used. To be more specific, we decided to use the 20 first principal components (PCs) and to build our kNN graph linking a cell to its 30 closest neighbors. Both factors are influencing the results and can be discussed.

Secondly, we think the method used would benefit from a manual selection for cluster aggregation into a specific cell type. Comparing the heatmaps produced before and after aggregation showed that we sometimes faced clusters with no clear marker expression. The cells belonging to those cluster could therefore artificially blow up the proportion of a cell type after aggregation even though they are not expressing the required markers. We think that a manual pick would allow a more refine selection and the introduction of an additional label bringing all the undefined cells together.

An other point to discuss is the change in cell type proportion. We saw that changes were important for lymphocytes and keratinocytes between healthy and cancerous tissues, but it is important to keep in mind that we only have the data of two patients and therefore not a very statistically reliable result. Also as mentioned before, e.g. clusters that have no strong expression for any of the known markers we looked at, could have an effect on the number of cells counted towards a certain cell type. This could potentially give us a false impression on how different a cell type is in the healthy and cancer sample. As mentioned before this would be something to look into some more, but the assumption is that it won’t have that much of an effect.

ADD DISCUSSION ABOUT IMPLICATION IN THERAPIES

Conclusion

References

Hsiao, Chiaowen Joyce, PoYuan Tung, John D. Blischak, Jonathan E. Burnett, Kenneth A. Barr, Kushal K. Dey, Matthew Stephens, and Yoav Gilad. 2020. “Characterizing and Inferring Quantitative Cell Cycle Phase in Single-Cell RNA-Seq Data Analysis.” Genome Research 30 (4): 611–21. https://doi.org/10.1101/gr.247759.118.

Kim, Doyoung, Kyung Bae Chung, and Tae-Gyun Kim. 2020. “Application of Single-Cell Rna Sequencing on Human Skin: Technical Evolution and Challenges.” Journal of Dermatological Science 99 (2): 74–81. https://doi.org/https://doi.org/10.1016/j.jdermsci.2020.06.002.

McFarland, James M., Brenton R. Paolella, Allison Warren, Kathryn Geiger-Schuller, Tsukasa Shibue, Michael Rothberg, Olena Kuksenko, et al. 2020. “Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action.” Nat. Commun. 11 (4296): 1–15. https://doi.org/10.1038/s41467-020-17440-w.

Robert Amezquita, Stephanie Hicks, Aaron Lun. 2020. “Orchestrating Single-Cell Analysis with Bioconductor.” http://bioconductor.org/books/release/OSCA/.

“Skin Cancer Facts & Statistics - the Skin Cancer Foundation.” 2020. Skin Cancer Foundation. https://www.skincancer.org/skin-cancer-information/skin-cancer-facts.

Solé-Boldo, Raddatz, L. 2020. “Single-Cell Transcriptomes of the Human Skin Reveal Age-Related Loss of Fibroblast Priming.” Communications Biology 3 (188). https://doi.org/https://doi.org/10.1038/s42003-020-0922-4.